In introduction, we mentioned that during the explorent component part, where we have to adress our own question and answer it through the data science cycle, we have chosen two deal with two different things. The first one, extending the dataset with weather data and check how they improve or not our models performance. Last but not least, as the most difficult, to make a prediction not for the overall pickups but by clustering Boston's station and predict the rides per cluster.
%load_ext autoreload
%autoreload 2
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import main as main
from tabulate import tabulate
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.multioutput import MultiOutputRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.linear_model import LogisticRegression
from sklearn.cluster import KMeans
from sklearn.metrics import r2_score, mean_squared_error
data = pd.read_csv('BikeSharing_Bluebikes2022.csv', index_col=0)
/home/georgep/anaconda3/envs/MachineLearning/lib/python3.8/site-packages/numpy/lib/arraysetops.py:580: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison mask |= (ar1 == a)
data.head()
| tripduration | starttime | stoptime | start station id | start station name | start station latitude | start station longitude | end station id | end station name | end station latitude | end station longitude | bikeid | usertype | postal code | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 597 | 2022-01-01 00:00:25.1660 | 2022-01-01 00:10:22.1920 | 178 | MIT Pacific St at Purrington St | 42.359573 | -71.101295 | 74 | Harvard Square at Mass Ave/ Dunster | 42.373268 | -71.118579 | 4923 | Subscriber | 02139 |
| 1 | 411 | 2022-01-01 00:00:40.4300 | 2022-01-01 00:07:32.1980 | 189 | Kendall T | 42.362428 | -71.084955 | 178 | MIT Pacific St at Purrington St | 42.359573 | -71.101295 | 3112 | Subscriber | 02139 |
| 2 | 476 | 2022-01-01 00:00:54.8180 | 2022-01-01 00:08:51.6680 | 94 | Main St at Austin St | 42.375603 | -71.064608 | 356 | Charlestown Navy Yard | 42.374125 | -71.054812 | 6901 | Customer | 02124 |
| 3 | 466 | 2022-01-01 00:01:01.6080 | 2022-01-01 00:08:48.2350 | 94 | Main St at Austin St | 42.375603 | -71.064608 | 356 | Charlestown Navy Yard | 42.374125 | -71.054812 | 5214 | Customer | 02124 |
| 4 | 752 | 2022-01-01 00:01:06.0520 | 2022-01-01 00:13:38.2300 | 19 | Park Dr at Buswell St | 42.347241 | -71.105301 | 41 | Packard's Corner - Commonwealth Ave at Brighto... | 42.352261 | -71.123831 | 2214 | Subscriber | 02215 |
data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 2305735 entries, 0 to 487200 Data columns (total 14 columns): # Column Dtype --- ------ ----- 0 tripduration int64 1 starttime object 2 stoptime object 3 start station id int64 4 start station name object 5 start station latitude float64 6 start station longitude float64 7 end station id int64 8 end station name object 9 end station latitude float64 10 end station longitude float64 11 bikeid int64 12 usertype object 13 postal code object dtypes: float64(4), int64(4), object(6) memory usage: 263.9+ MB
#transforming the dates from object to datetime
for date_column in ['starttime','stoptime']:
data[date_column] = pd.to_datetime(data[date_column], format='%Y-%m-%d %H:%M:%S')
'''
Again with the DatetimeInterval function we set the datetime as the index and calculate the pickups during
the intervals as before.
'''
df15 = main.DatetimeInterval(data, freq='15Min')
df30 = main.DatetimeInterval(data, freq='30Min')
df60 = main.DatetimeInterval(data, freq='60Min')
df120 = main.DatetimeInterval(data, freq='120Min')
These Weather Data have been downloaded from https://www.visualcrossing.com/weather-data using the Easy Global Weather API. They are hourly observations from weather stations of Boston also from the 01/01/2022 until the 31/08/2022. Since they have many different columns they will definitely need to be preproccessed.
Let's start then!
weather_data = pd.read_csv('Data/WeatherData', index_col=0)
weather_data.head()
| name | datetime | temp | feelslike | dew | humidity | precip | precipprob | preciptype | snow | ... | sealevelpressure | cloudcover | visibility | solarradiation | solarenergy | uvindex | severerisk | conditions | icon | stations | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Boston | 2022-01-01T00:00:00 | 7.8 | 7.8 | 6.7 | 92.50 | 0.0 | 0 | NaN | 0.0 | ... | 1014.5 | 100.0 | 8.0 | NaN | NaN | NaN | NaN | Overcast | cloudy | KOWD,72509854704,KBED,KBOS,72509014739,7250591... |
| 1 | Boston | 2022-01-01T01:00:00 | 7.2 | 6.5 | 6.7 | 96.49 | 0.0 | 0 | NaN | 0.0 | ... | 1014.1 | 100.0 | 5.1 | NaN | NaN | NaN | NaN | Overcast | cloudy | KOWD,72509854704,KBED,KBOS,72509014739,7250591... |
| 2 | Boston | 2022-01-01T02:00:00 | 7.2 | 6.0 | 6.7 | 96.49 | 0.0 | 0 | NaN | 0.0 | ... | 1014.2 | 100.0 | 4.0 | NaN | NaN | NaN | NaN | Overcast | cloudy | KOWD,72509854704,KBED,KBOS,72509014739,7250591... |
| 3 | Boston | 2022-01-01T03:00:00 | 7.2 | 7.2 | 6.7 | 96.60 | 0.0 | 0 | NaN | 0.0 | ... | 1014.1 | 100.0 | 1.0 | NaN | NaN | NaN | NaN | Overcast | cloudy | KOWD,72509854704,KBED,KBOS,72509014739,7250591... |
| 4 | Boston | 2022-01-01T04:00:00 | 6.8 | 5.4 | 6.7 | 99.79 | 0.0 | 0 | NaN | 0.0 | ... | 1013.6 | 100.0 | 0.0 | NaN | NaN | NaN | NaN | Overcast | cloudy | KOWD,72509854704,KBED,KBOS,72509014739,7250591... |
5 rows × 24 columns
weather_data['datetime'] = pd.to_datetime(weather_data['datetime'], format='%Y-%m-%dT%H:%M:%S')
weather_data.drop('name',axis=1, inplace=True)
weather_data.set_index('datetime', inplace=True)
weather_data.info()
<class 'pandas.core.frame.DataFrame'> DatetimeIndex: 5831 entries, 2022-01-01 00:00:00 to 2022-08-31 23:00:00 Data columns (total 22 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 temp 5831 non-null float64 1 feelslike 5831 non-null float64 2 dew 5831 non-null float64 3 humidity 5831 non-null float64 4 precip 5831 non-null float64 5 precipprob 5831 non-null int64 6 preciptype 547 non-null object 7 snow 5831 non-null float64 8 snowdepth 5831 non-null float64 9 windgust 5675 non-null float64 10 windspeed 5831 non-null float64 11 winddir 5831 non-null float64 12 sealevelpressure 5831 non-null float64 13 cloudcover 5831 non-null float64 14 visibility 5831 non-null float64 15 solarradiation 5691 non-null float64 16 solarenergy 3284 non-null float64 17 uvindex 5691 non-null float64 18 severerisk 5601 non-null float64 19 conditions 5831 non-null object 20 icon 5831 non-null object 21 stations 5831 non-null object dtypes: float64(17), int64(1), object(4) memory usage: 1.0+ MB
weather_data.shape # the same shape as our hourly groupped data
(5831, 22)
At first glance, we drop the columns that we think are not very helpful and on the other hand we will process the weather columns that we think are instead. Later we will probably drop a few of them based on the knowledge we will acquire through the cleaning and analysis.
weather_data.drop(['dew','windgust','precipprob','windgust','winddir','sealevelpressure','solarenergy','cloudcover','icon','stations'], axis=1, inplace=True)
weather_data.columns
Index(['temp', 'feelslike', 'humidity', 'precip', 'preciptype', 'snow',
'snowdepth', 'windspeed', 'visibility', 'solarradiation', 'uvindex',
'severerisk', 'conditions'],
dtype='object')
Checking for nan values and after that we start check each column one by one.
weather_data.isna().sum()
temp 0 feelslike 0 humidity 0 precip 0 preciptype 5284 snow 0 snowdepth 0 windspeed 0 visibility 0 solarradiation 140 uvindex 140 severerisk 230 conditions 0 dtype: int64
Temp
weather_data['temp'].describe()
count 5831.000000 mean 12.352478 std 11.056270 min -15.600000 25% 4.300000 50% 12.900000 75% 21.600000 max 37.200000 Name: temp, dtype: float64
sns.histplot(data=weather_data, x="temp")
<AxesSubplot:xlabel='temp', ylabel='Count'>
feelslike
weather_data['feelslike'].describe()
count 5831.000000 mean 10.596210 std 13.410548 min -26.500000 25% 0.800000 50% 12.900000 75% 21.600000 max 39.500000 Name: feelslike, dtype: float64
sns.histplot(data=weather_data, x="feelslike")
<AxesSubplot:xlabel='feelslike', ylabel='Count'>
They seem very simillar except from the upper and lower bounds, something very reasonable as in extreme temperatures the feel is even worse that the observed value.
plt.figure(figsize=(10,6))
plt.scatter(x='temp',y='feelslike', data=weather_data)
<matplotlib.collections.PathCollection at 0x7fde28b6e5b0>
Humidity
weather_data['humidity'].describe()
count 5831.000000 mean 62.503353 std 20.229678 min 15.130000 25% 46.120000 50% 61.270000 75% 79.645000 max 99.940000 Name: humidity, dtype: float64
sns.histplot(data=weather_data, x="humidity")
<AxesSubplot:xlabel='humidity', ylabel='Count'>
precip
weather_data['precip'].describe()
count 5831.000000 mean 0.077047 std 0.451573 min 0.000000 25% 0.000000 50% 0.000000 75% 0.000000 max 10.546000 Name: precip, dtype: float64
plt.figure(figsize=(10,8))
sns.histplot(data=weather_data, x="precip", hue="preciptype")
<AxesSubplot:xlabel='precip', ylabel='Count'>
plt.figure(figsize=(10,8))
sns.histplot(data=weather_data, x="precip", hue="conditions")
<AxesSubplot:xlabel='precip', ylabel='Count'>
weather_data[weather_data['precip']>0]['conditions'].unique()
array(['Rain, Overcast', 'Snow, Rain, Overcast', 'Snow, Overcast',
'Rain, Partially cloudy', 'Snow, Ice, Overcast',
'Snow, Partially cloudy'], dtype=object)
Preciptype
weather_data['preciptype'].nunique()
5
weather_data['preciptype'].unique()
array([nan, 'rain', 'rain,snow', 'snow', 'freezingrain', 'snow,ice'],
dtype=object)
weather_data[weather_data['preciptype']=='snow'][['snow','snowdepth']]
| snow | snowdepth | |
|---|---|---|
| datetime | ||
| 2022-01-07 05:00:00 | 1.04 | 21.88 |
| 2022-01-24 00:00:00 | 0.00 | 0.10 |
| 2022-01-25 00:00:00 | 0.13 | 2.00 |
| 2022-01-25 01:00:00 | 0.13 | 2.13 |
| 2022-01-29 02:00:00 | 0.42 | 7.50 |
| ... | ... | ... |
| 2022-03-03 03:00:00 | 0.70 | 13.42 |
| 2022-03-03 04:00:00 | 1.10 | 13.33 |
| 2022-03-03 05:00:00 | 0.40 | 13.25 |
| 2022-03-12 20:00:00 | 1.30 | 1.70 |
| 2022-03-28 08:00:00 | 0.00 | 0.00 |
73 rows × 2 columns
"""
applying a map in the preciptype column to transform it in categorical with Label Encoding.
We group-label them according to the type of precipitation
"""
weather_data['enc_preciptype'] = weather_data['preciptype'].map(lambda row: main.PreciptypeMap(row)).astype('int32')
weather_data.drop('preciptype',axis=1, inplace=True)
plt.figure(figsize=(10,8))
sns.catplot(data=weather_data, x="enc_preciptype")
<seaborn.axisgrid.FacetGrid at 0x7fde285a1b80>
<Figure size 720x576 with 0 Axes>
snow
weather_data['snow'].describe()
count 5831.000000 mean 0.019317 std 0.142421 min 0.000000 25% 0.000000 50% 0.000000 75% 0.000000 max 6.600000 Name: snow, dtype: float64
weather_data[weather_data['snow']>2]['conditions'].unique()
array(['Partially cloudy'], dtype=object)
weather_data[weather_data['snow']>2].index
DatetimeIndex(['2022-01-08 09:00:00'], dtype='datetime64[ns]', name='datetime', freq=None)
windspeed
weather_data['windspeed'].describe()
count 5831.000000 mean 17.790070 std 8.714233 min 0.000000 25% 11.200000 50% 16.500000 75% 23.700000 max 64.000000 Name: windspeed, dtype: float64
plt.figure(figsize=(10,8))
sns.histplot(data=weather_data, x="windspeed")
<AxesSubplot:xlabel='windspeed', ylabel='Count'>
Conditions
weather_data['conditions'].unique()
array(['Overcast', 'Rain, Overcast', 'Partially cloudy', 'Clear',
'Snow, Rain, Overcast', 'Snow, Overcast', 'Rain, Partially cloudy',
'Snow, Ice, Overcast', 'Snow, Partially cloudy'], dtype=object)
'''
mapping a dict in conditions column to transform it in categorical with Label Encoding.
We group-label them according to the type of precipitation
'''
weather_data['enc_conditions'] = weather_data['conditions'].map(lambda row: main.ConditionsMap(row)).astype('int32')
weather_data.drop('conditions', axis=1, inplace=True)
weather_data.head()
| temp | feelslike | humidity | precip | snow | snowdepth | windspeed | visibility | solarradiation | uvindex | severerisk | enc_preciptype | enc_conditions | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| datetime | |||||||||||||
| 2022-01-01 00:00:00 | 7.8 | 7.8 | 92.50 | 0.0 | 0.0 | 0.0 | 0.1 | 8.0 | NaN | NaN | NaN | 1 | 1 |
| 2022-01-01 01:00:00 | 7.2 | 6.5 | 96.49 | 0.0 | 0.0 | 0.0 | 5.3 | 5.1 | NaN | NaN | NaN | 1 | 1 |
| 2022-01-01 02:00:00 | 7.2 | 6.0 | 96.49 | 0.0 | 0.0 | 0.0 | 7.3 | 4.0 | NaN | NaN | NaN | 1 | 1 |
| 2022-01-01 03:00:00 | 7.2 | 7.2 | 96.60 | 0.0 | 0.0 | 0.0 | 0.1 | 1.0 | NaN | NaN | NaN | 1 | 1 |
| 2022-01-01 04:00:00 | 6.8 | 5.4 | 99.79 | 0.0 | 0.0 | 0.0 | 7.4 | 0.0 | NaN | NaN | NaN | 1 | 1 |
Severerisk
weather_data['severerisk'].unique()
array([nan, 10., 3., 5., 30., 15., 8., 60., 75.])
'''
since there are nan values we could say that they are days with no risk
'''
weather_data['severerisk'].fillna(weather_data['severerisk'].median(), inplace=True)
solar radiation
weather_data['solarradiation'].min()
0.0
#weather_data[weather_data['solarradiation'].isna()==True]
weather_data[weather_data['solarradiation']==0.0]['severerisk'].unique()
array([10., 3., 5., 15., 30., 60., 8.])
weather_data[weather_data['solarradiation']==0.0]['enc_conditions'].unique()
array([1, 2, 3], dtype=int32)
weather_data[weather_data['solarradiation'].isna()==True]['severerisk'].unique()
array([10.])
'''
it is nan during the hours with out sun and that's why we fill the missing values with 0.0
'''
sorted(weather_data[weather_data['solarradiation'].isna()==True].index)
[Timestamp('2022-01-01 00:00:00'),
Timestamp('2022-01-01 01:00:00'),
Timestamp('2022-01-01 02:00:00'),
Timestamp('2022-01-01 03:00:00'),
Timestamp('2022-01-01 04:00:00'),
Timestamp('2022-01-01 05:00:00'),
Timestamp('2022-01-01 06:00:00'),
Timestamp('2022-01-01 07:00:00'),
Timestamp('2022-01-01 17:00:00'),
Timestamp('2022-01-01 18:00:00'),
Timestamp('2022-01-01 19:00:00'),
Timestamp('2022-01-01 20:00:00'),
Timestamp('2022-01-01 21:00:00'),
Timestamp('2022-01-01 22:00:00'),
Timestamp('2022-01-01 23:00:00'),
Timestamp('2022-01-02 00:00:00'),
Timestamp('2022-01-02 01:00:00'),
Timestamp('2022-01-02 02:00:00'),
Timestamp('2022-01-02 03:00:00'),
Timestamp('2022-01-02 04:00:00'),
Timestamp('2022-01-02 05:00:00'),
Timestamp('2022-01-02 06:00:00'),
Timestamp('2022-01-02 07:00:00'),
Timestamp('2022-01-02 17:00:00'),
Timestamp('2022-01-02 18:00:00'),
Timestamp('2022-01-02 19:00:00'),
Timestamp('2022-01-02 20:00:00'),
Timestamp('2022-01-02 21:00:00'),
Timestamp('2022-01-02 22:00:00'),
Timestamp('2022-01-02 23:00:00'),
Timestamp('2022-01-03 00:00:00'),
Timestamp('2022-01-03 01:00:00'),
Timestamp('2022-01-03 02:00:00'),
Timestamp('2022-01-03 03:00:00'),
Timestamp('2022-01-03 04:00:00'),
Timestamp('2022-01-03 05:00:00'),
Timestamp('2022-01-03 06:00:00'),
Timestamp('2022-01-03 07:00:00'),
Timestamp('2022-01-03 17:00:00'),
Timestamp('2022-01-03 18:00:00'),
Timestamp('2022-01-03 19:00:00'),
Timestamp('2022-01-03 20:00:00'),
Timestamp('2022-01-03 21:00:00'),
Timestamp('2022-01-03 22:00:00'),
Timestamp('2022-01-03 23:00:00'),
Timestamp('2022-01-04 00:00:00'),
Timestamp('2022-01-04 01:00:00'),
Timestamp('2022-01-04 02:00:00'),
Timestamp('2022-01-04 03:00:00'),
Timestamp('2022-01-04 04:00:00'),
Timestamp('2022-01-04 05:00:00'),
Timestamp('2022-01-04 06:00:00'),
Timestamp('2022-01-04 07:00:00'),
Timestamp('2022-01-04 18:00:00'),
Timestamp('2022-01-04 19:00:00'),
Timestamp('2022-01-04 20:00:00'),
Timestamp('2022-01-04 21:00:00'),
Timestamp('2022-01-04 22:00:00'),
Timestamp('2022-01-04 23:00:00'),
Timestamp('2022-01-05 00:00:00'),
Timestamp('2022-01-05 01:00:00'),
Timestamp('2022-01-05 02:00:00'),
Timestamp('2022-01-05 03:00:00'),
Timestamp('2022-01-05 04:00:00'),
Timestamp('2022-01-05 05:00:00'),
Timestamp('2022-01-05 06:00:00'),
Timestamp('2022-01-05 07:00:00'),
Timestamp('2022-01-05 17:00:00'),
Timestamp('2022-01-05 18:00:00'),
Timestamp('2022-01-05 19:00:00'),
Timestamp('2022-01-05 20:00:00'),
Timestamp('2022-01-05 21:00:00'),
Timestamp('2022-01-05 22:00:00'),
Timestamp('2022-01-05 23:00:00'),
Timestamp('2022-01-06 00:00:00'),
Timestamp('2022-01-06 01:00:00'),
Timestamp('2022-01-06 02:00:00'),
Timestamp('2022-01-06 03:00:00'),
Timestamp('2022-01-06 04:00:00'),
Timestamp('2022-01-06 05:00:00'),
Timestamp('2022-01-06 06:00:00'),
Timestamp('2022-01-06 07:00:00'),
Timestamp('2022-01-06 18:00:00'),
Timestamp('2022-01-06 19:00:00'),
Timestamp('2022-01-06 20:00:00'),
Timestamp('2022-01-06 21:00:00'),
Timestamp('2022-01-06 22:00:00'),
Timestamp('2022-01-06 23:00:00'),
Timestamp('2022-01-07 00:00:00'),
Timestamp('2022-01-07 01:00:00'),
Timestamp('2022-01-07 02:00:00'),
Timestamp('2022-01-07 03:00:00'),
Timestamp('2022-01-07 04:00:00'),
Timestamp('2022-01-07 05:00:00'),
Timestamp('2022-01-07 06:00:00'),
Timestamp('2022-01-07 07:00:00'),
Timestamp('2022-01-07 08:00:00'),
Timestamp('2022-01-07 18:00:00'),
Timestamp('2022-01-07 19:00:00'),
Timestamp('2022-01-07 20:00:00'),
Timestamp('2022-01-07 21:00:00'),
Timestamp('2022-01-07 22:00:00'),
Timestamp('2022-01-07 23:00:00'),
Timestamp('2022-01-08 00:00:00'),
Timestamp('2022-01-08 01:00:00'),
Timestamp('2022-01-08 02:00:00'),
Timestamp('2022-01-08 03:00:00'),
Timestamp('2022-01-08 04:00:00'),
Timestamp('2022-01-08 05:00:00'),
Timestamp('2022-01-08 06:00:00'),
Timestamp('2022-01-08 07:00:00'),
Timestamp('2022-01-08 18:00:00'),
Timestamp('2022-01-08 19:00:00'),
Timestamp('2022-01-08 20:00:00'),
Timestamp('2022-01-08 21:00:00'),
Timestamp('2022-01-08 22:00:00'),
Timestamp('2022-01-08 23:00:00'),
Timestamp('2022-01-09 00:00:00'),
Timestamp('2022-01-09 01:00:00'),
Timestamp('2022-01-09 02:00:00'),
Timestamp('2022-01-09 03:00:00'),
Timestamp('2022-01-09 04:00:00'),
Timestamp('2022-01-09 05:00:00'),
Timestamp('2022-01-09 06:00:00'),
Timestamp('2022-01-09 07:00:00'),
Timestamp('2022-01-09 17:00:00'),
Timestamp('2022-01-09 18:00:00'),
Timestamp('2022-01-09 19:00:00'),
Timestamp('2022-01-09 20:00:00'),
Timestamp('2022-01-09 21:00:00'),
Timestamp('2022-01-09 22:00:00'),
Timestamp('2022-01-09 23:00:00'),
Timestamp('2022-01-10 00:00:00'),
Timestamp('2022-01-10 01:00:00'),
Timestamp('2022-01-10 02:00:00'),
Timestamp('2022-01-10 03:00:00'),
Timestamp('2022-01-10 04:00:00'),
Timestamp('2022-01-10 05:00:00'),
Timestamp('2022-01-10 06:00:00'),
Timestamp('2022-01-10 07:00:00')]
weather_data['solarradiation'].fillna(0.0, inplace=True)
UVindex
weather_data['uvindex'].unique()
array([nan, 0., 1., 2., 3., 4., 5., 6., 7., 8., 9., 10.])
sorted(weather_data[weather_data['uvindex'].isna()==True].index)
[Timestamp('2022-01-01 00:00:00'),
Timestamp('2022-01-01 01:00:00'),
Timestamp('2022-01-01 02:00:00'),
Timestamp('2022-01-01 03:00:00'),
Timestamp('2022-01-01 04:00:00'),
Timestamp('2022-01-01 05:00:00'),
Timestamp('2022-01-01 06:00:00'),
Timestamp('2022-01-01 07:00:00'),
Timestamp('2022-01-01 17:00:00'),
Timestamp('2022-01-01 18:00:00'),
Timestamp('2022-01-01 19:00:00'),
Timestamp('2022-01-01 20:00:00'),
Timestamp('2022-01-01 21:00:00'),
Timestamp('2022-01-01 22:00:00'),
Timestamp('2022-01-01 23:00:00'),
Timestamp('2022-01-02 00:00:00'),
Timestamp('2022-01-02 01:00:00'),
Timestamp('2022-01-02 02:00:00'),
Timestamp('2022-01-02 03:00:00'),
Timestamp('2022-01-02 04:00:00'),
Timestamp('2022-01-02 05:00:00'),
Timestamp('2022-01-02 06:00:00'),
Timestamp('2022-01-02 07:00:00'),
Timestamp('2022-01-02 17:00:00'),
Timestamp('2022-01-02 18:00:00'),
Timestamp('2022-01-02 19:00:00'),
Timestamp('2022-01-02 20:00:00'),
Timestamp('2022-01-02 21:00:00'),
Timestamp('2022-01-02 22:00:00'),
Timestamp('2022-01-02 23:00:00'),
Timestamp('2022-01-03 00:00:00'),
Timestamp('2022-01-03 01:00:00'),
Timestamp('2022-01-03 02:00:00'),
Timestamp('2022-01-03 03:00:00'),
Timestamp('2022-01-03 04:00:00'),
Timestamp('2022-01-03 05:00:00'),
Timestamp('2022-01-03 06:00:00'),
Timestamp('2022-01-03 07:00:00'),
Timestamp('2022-01-03 17:00:00'),
Timestamp('2022-01-03 18:00:00'),
Timestamp('2022-01-03 19:00:00'),
Timestamp('2022-01-03 20:00:00'),
Timestamp('2022-01-03 21:00:00'),
Timestamp('2022-01-03 22:00:00'),
Timestamp('2022-01-03 23:00:00'),
Timestamp('2022-01-04 00:00:00'),
Timestamp('2022-01-04 01:00:00'),
Timestamp('2022-01-04 02:00:00'),
Timestamp('2022-01-04 03:00:00'),
Timestamp('2022-01-04 04:00:00'),
Timestamp('2022-01-04 05:00:00'),
Timestamp('2022-01-04 06:00:00'),
Timestamp('2022-01-04 07:00:00'),
Timestamp('2022-01-04 18:00:00'),
Timestamp('2022-01-04 19:00:00'),
Timestamp('2022-01-04 20:00:00'),
Timestamp('2022-01-04 21:00:00'),
Timestamp('2022-01-04 22:00:00'),
Timestamp('2022-01-04 23:00:00'),
Timestamp('2022-01-05 00:00:00'),
Timestamp('2022-01-05 01:00:00'),
Timestamp('2022-01-05 02:00:00'),
Timestamp('2022-01-05 03:00:00'),
Timestamp('2022-01-05 04:00:00'),
Timestamp('2022-01-05 05:00:00'),
Timestamp('2022-01-05 06:00:00'),
Timestamp('2022-01-05 07:00:00'),
Timestamp('2022-01-05 17:00:00'),
Timestamp('2022-01-05 18:00:00'),
Timestamp('2022-01-05 19:00:00'),
Timestamp('2022-01-05 20:00:00'),
Timestamp('2022-01-05 21:00:00'),
Timestamp('2022-01-05 22:00:00'),
Timestamp('2022-01-05 23:00:00'),
Timestamp('2022-01-06 00:00:00'),
Timestamp('2022-01-06 01:00:00'),
Timestamp('2022-01-06 02:00:00'),
Timestamp('2022-01-06 03:00:00'),
Timestamp('2022-01-06 04:00:00'),
Timestamp('2022-01-06 05:00:00'),
Timestamp('2022-01-06 06:00:00'),
Timestamp('2022-01-06 07:00:00'),
Timestamp('2022-01-06 18:00:00'),
Timestamp('2022-01-06 19:00:00'),
Timestamp('2022-01-06 20:00:00'),
Timestamp('2022-01-06 21:00:00'),
Timestamp('2022-01-06 22:00:00'),
Timestamp('2022-01-06 23:00:00'),
Timestamp('2022-01-07 00:00:00'),
Timestamp('2022-01-07 01:00:00'),
Timestamp('2022-01-07 02:00:00'),
Timestamp('2022-01-07 03:00:00'),
Timestamp('2022-01-07 04:00:00'),
Timestamp('2022-01-07 05:00:00'),
Timestamp('2022-01-07 06:00:00'),
Timestamp('2022-01-07 07:00:00'),
Timestamp('2022-01-07 08:00:00'),
Timestamp('2022-01-07 18:00:00'),
Timestamp('2022-01-07 19:00:00'),
Timestamp('2022-01-07 20:00:00'),
Timestamp('2022-01-07 21:00:00'),
Timestamp('2022-01-07 22:00:00'),
Timestamp('2022-01-07 23:00:00'),
Timestamp('2022-01-08 00:00:00'),
Timestamp('2022-01-08 01:00:00'),
Timestamp('2022-01-08 02:00:00'),
Timestamp('2022-01-08 03:00:00'),
Timestamp('2022-01-08 04:00:00'),
Timestamp('2022-01-08 05:00:00'),
Timestamp('2022-01-08 06:00:00'),
Timestamp('2022-01-08 07:00:00'),
Timestamp('2022-01-08 18:00:00'),
Timestamp('2022-01-08 19:00:00'),
Timestamp('2022-01-08 20:00:00'),
Timestamp('2022-01-08 21:00:00'),
Timestamp('2022-01-08 22:00:00'),
Timestamp('2022-01-08 23:00:00'),
Timestamp('2022-01-09 00:00:00'),
Timestamp('2022-01-09 01:00:00'),
Timestamp('2022-01-09 02:00:00'),
Timestamp('2022-01-09 03:00:00'),
Timestamp('2022-01-09 04:00:00'),
Timestamp('2022-01-09 05:00:00'),
Timestamp('2022-01-09 06:00:00'),
Timestamp('2022-01-09 07:00:00'),
Timestamp('2022-01-09 17:00:00'),
Timestamp('2022-01-09 18:00:00'),
Timestamp('2022-01-09 19:00:00'),
Timestamp('2022-01-09 20:00:00'),
Timestamp('2022-01-09 21:00:00'),
Timestamp('2022-01-09 22:00:00'),
Timestamp('2022-01-09 23:00:00'),
Timestamp('2022-01-10 00:00:00'),
Timestamp('2022-01-10 01:00:00'),
Timestamp('2022-01-10 02:00:00'),
Timestamp('2022-01-10 03:00:00'),
Timestamp('2022-01-10 04:00:00'),
Timestamp('2022-01-10 05:00:00'),
Timestamp('2022-01-10 06:00:00'),
Timestamp('2022-01-10 07:00:00')]
'''
the same as before
'''
weather_data['uvindex'].fillna(0.0, inplace=True)
weather_data.head()
| temp | feelslike | humidity | precip | snow | snowdepth | windspeed | visibility | solarradiation | uvindex | severerisk | enc_preciptype | enc_conditions | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| datetime | |||||||||||||
| 2022-01-01 00:00:00 | 7.8 | 7.8 | 92.50 | 0.0 | 0.0 | 0.0 | 0.1 | 8.0 | 0.0 | 0.0 | 10.0 | 1 | 1 |
| 2022-01-01 01:00:00 | 7.2 | 6.5 | 96.49 | 0.0 | 0.0 | 0.0 | 5.3 | 5.1 | 0.0 | 0.0 | 10.0 | 1 | 1 |
| 2022-01-01 02:00:00 | 7.2 | 6.0 | 96.49 | 0.0 | 0.0 | 0.0 | 7.3 | 4.0 | 0.0 | 0.0 | 10.0 | 1 | 1 |
| 2022-01-01 03:00:00 | 7.2 | 7.2 | 96.60 | 0.0 | 0.0 | 0.0 | 0.1 | 1.0 | 0.0 | 0.0 | 10.0 | 1 | 1 |
| 2022-01-01 04:00:00 | 6.8 | 5.4 | 99.79 | 0.0 | 0.0 | 0.0 | 7.4 | 0.0 | 0.0 | 0.0 | 10.0 | 1 | 1 |
'''
again we group are data after cleaning them in the well-known intervals but with only a difference.
Since we have hourly data in case of 15 and 30 minutes we assume the same conditions and in case of 2 hours we will
calculate the mean of the previous 2 hours
'''
weather_data15 = weather_data.resample('15Min').fillna(method='ffill')
weather_data30 = weather_data.resample('30Min').fillna(method='ffill')
weather_data60 = weather_data
weather_data120 = weather_data.resample('2H').mean()
weather_data15.head(10)
| temp | feelslike | humidity | precip | snow | snowdepth | windspeed | visibility | solarradiation | uvindex | severerisk | enc_preciptype | enc_conditions | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| datetime | |||||||||||||
| 2022-01-01 00:00:00 | 7.8 | 7.8 | 92.50 | 0.0 | 0.0 | 0.0 | 0.1 | 8.0 | 0.0 | 0.0 | 10.0 | 1 | 1 |
| 2022-01-01 00:15:00 | 7.8 | 7.8 | 92.50 | 0.0 | 0.0 | 0.0 | 0.1 | 8.0 | 0.0 | 0.0 | 10.0 | 1 | 1 |
| 2022-01-01 00:30:00 | 7.8 | 7.8 | 92.50 | 0.0 | 0.0 | 0.0 | 0.1 | 8.0 | 0.0 | 0.0 | 10.0 | 1 | 1 |
| 2022-01-01 00:45:00 | 7.8 | 7.8 | 92.50 | 0.0 | 0.0 | 0.0 | 0.1 | 8.0 | 0.0 | 0.0 | 10.0 | 1 | 1 |
| 2022-01-01 01:00:00 | 7.2 | 6.5 | 96.49 | 0.0 | 0.0 | 0.0 | 5.3 | 5.1 | 0.0 | 0.0 | 10.0 | 1 | 1 |
| 2022-01-01 01:15:00 | 7.2 | 6.5 | 96.49 | 0.0 | 0.0 | 0.0 | 5.3 | 5.1 | 0.0 | 0.0 | 10.0 | 1 | 1 |
| 2022-01-01 01:30:00 | 7.2 | 6.5 | 96.49 | 0.0 | 0.0 | 0.0 | 5.3 | 5.1 | 0.0 | 0.0 | 10.0 | 1 | 1 |
| 2022-01-01 01:45:00 | 7.2 | 6.5 | 96.49 | 0.0 | 0.0 | 0.0 | 5.3 | 5.1 | 0.0 | 0.0 | 10.0 | 1 | 1 |
| 2022-01-01 02:00:00 | 7.2 | 6.0 | 96.49 | 0.0 | 0.0 | 0.0 | 7.3 | 4.0 | 0.0 | 0.0 | 10.0 | 1 | 1 |
| 2022-01-01 02:15:00 | 7.2 | 6.0 | 96.49 | 0.0 | 0.0 | 0.0 | 7.3 | 4.0 | 0.0 | 0.0 | 10.0 | 1 | 1 |
'''
It's time for merging our different dataframes
'''
picksup15 = main.MergingDataFrames(weather_data15, df15)
picksup30 = main.MergingDataFrames(weather_data30, df30)
picksup60 = main.MergingDataFrames(weather_data60, df60)
picksup120 = main.MergingDataFrames(weather_data120, df120)
'''
creating a dictionary for one-hot encoding the season and then we specify the final weather columns we are going
to use, considering them as the most relevant
'''
season_month = {1:'Winter', 2:'Winter',
3:'Spring', 4:'Spring', 5:'Spring',
6:'Summer', 7:'Summer', 8:'Summer'}
weather_columns = [
'temp', 'feelslike', 'humidity',
'precip', 'snow', 'snowdepth',
'windspeed', 'visibility', 'uvindex',
'severerisk', 'enc_preciptype', 'enc_conditions'
]
for pickup in [picksup15,picksup30, picksup60, picksup120]:
pickup = main.DataPreprocess(pickup, season_month)
pickup = pd.get_dummies(pickup, 'season')
'''
formatting the final dataset for the next step of visualizations by creating also the
relevant lags
'''
picksup15 = pd.get_dummies(picksup15, 'season')
picksup30 = pd.get_dummies(picksup30, 'season')
picksup60 = pd.get_dummies(picksup60, 'season')
picksup120 = pd.get_dummies(picksup120, 'season')
picksup15 = main.WeatherLaging(picksup15, 15, weather_columns)
picksup30 = main.WeatherLaging(picksup30, 30, weather_columns)
picksup60 = main.WeatherLaging(picksup60, 60, weather_columns)
picksup120 = main.WeatherLaging(picksup120, 120, weather_columns)
The columns we have kept so far are:
enc_preciptype -
1: no precipitation
2: rain
3: rain,snow
4: snow
5: freezingrain,snow,ice
enc_conditions -
1: 'Overcast','Partially cloudy','Clear'
2: 'Rain, Overcast','Rain, Partially cloudy','Snow, Partially cloudy'
3: 'Snow, Rain, Overcast','Snow, Overcast','Snow, Ice, Overcast'
'''
select one of the different intervals assumimg the same behaviour from all of them
'''
picksup = picksup30.copy()
picksup.describe()
| temp | feelslike | humidity | precip | snow | snowdepth | windspeed | visibility | solarradiation | uvindex | ... | workingday | holiday | month | hour | minute | season_Spring | season_Summer | season_Winter | lag(pickups,60-90) | lag(pickups,90-120) | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 11659.000000 | 11659.000000 | 11659.000000 | 11659.000000 | 11659.000000 | 11659.000000 | 11659.000000 | 11659.000000 | 11659.000000 | 11659.000000 | ... | 11659.000000 | 11659.000000 | 11659.000000 | 11659.000000 | 11659.000000 | 11659.000000 | 11659.000000 | 11659.000000 | 11659.000000 | 11659.000000 |
| mean | 12.347663 | 10.589347 | 62.502053 | 0.077067 | 0.019322 | 2.161700 | 17.799580 | 14.869886 | 135.337679 | 1.339738 | ... | 0.712154 | 0.028476 | 4.527661 | 11.502788 | 14.998713 | 0.378763 | 0.378677 | 0.242559 | 197.443005 | 197.346256 |
| std | 11.057928 | 13.413959 | 20.225247 | 0.451610 | 0.142433 | 5.760949 | 8.719053 | 3.543898 | 202.113948 | 2.045488 | ... | 0.452779 | 0.166335 | 2.295053 | 6.920143 | 15.000643 | 0.485100 | 0.485078 | 0.428649 | 195.815357 | 195.601478 |
| min | -15.600000 | -26.500000 | 15.130000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 4.300000 | 0.800000 | 46.130000 | 0.000000 | 0.000000 | 0.000000 | 11.200000 | 16.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 3.000000 | 6.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 34.000000 | 34.000000 |
| 50% | 12.900000 | 12.900000 | 61.270000 | 0.000000 | 0.000000 | 0.000000 | 16.500000 | 16.000000 | 16.000000 | 0.000000 | ... | 1.000000 | 0.000000 | 5.000000 | 12.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 137.000000 | 137.000000 |
| 75% | 21.600000 | 21.600000 | 79.640000 | 0.000000 | 0.000000 | 0.000000 | 23.700000 | 16.000000 | 210.000000 | 2.000000 | ... | 1.000000 | 0.000000 | 7.000000 | 17.500000 | 30.000000 | 1.000000 | 1.000000 | 0.000000 | 303.000000 | 303.000000 |
| max | 37.200000 | 39.500000 | 99.940000 | 10.546000 | 6.600000 | 36.000000 | 64.000000 | 16.000000 | 1006.000000 | 10.000000 | ... | 1.000000 | 1.000000 | 8.000000 | 23.000000 | 30.000000 | 1.000000 | 1.000000 | 1.000000 | 1225.000000 | 1225.000000 |
8 rows × 26 columns
By creating a heatmap of correlation, at the first glance we are able to identify the following:
Pickups are significantly negatively correlated: season and less with snowdepth, humidity
Which means how the rides react in these measure changes.
plt.figure(figsize=(16,12))
sns.heatmap(picksup.corr(), annot=True)
<AxesSubplot:>
plt.figure(figsize=(16,8))
sns.catplot(data=picksup, x='workingday', y='pickups')
<seaborn.axisgrid.FacetGrid at 0x7fde2762dfd0>
<Figure size 1152x576 with 0 Axes>
plt.figure(figsize=(16,8))
sns.catplot(data=picksup, x='holiday', y='pickups')
<seaborn.axisgrid.FacetGrid at 0x7fde2762da90>
<Figure size 1152x576 with 0 Axes>
plt.figure(figsize=(16,8))
sns.boxplot(data=picksup, x='enc_conditions', y='pickups')
<AxesSubplot:xlabel='enc_conditions', ylabel='pickups'>
Dropping the last columns we cocluded are not critical and transforming into category type the categorical one. Wnding up with the following columns:
for picks in [picksup15,picksup30,picksup60,picksup120]:
picks.drop(['usertype_Customer','usertype_Subscriber','severerisk','solarradiation','snow','minute'], axis=1, inplace=True)
picks[['enc_preciptype', 'enc_conditions','workingday', 'holiday','season_Spring', 'season_Summer',
'season_Winter','month', 'hour']] = picks[['enc_preciptype', 'enc_conditions','workingday', 'holiday','season_Spring', 'season_Summer',
'season_Winter','month', 'hour']].astype('category')
picksup15.columns
Index(['temp', 'feelslike', 'humidity', 'precip', 'snowdepth', 'windspeed',
'visibility', 'uvindex', 'enc_preciptype', 'enc_conditions', 'pickups',
'workingday', 'holiday', 'month', 'hour', 'season_Spring',
'season_Summer', 'season_Winter', 'lag(pickups,60-75)',
'lag(pickups,75-90)', 'lag(pickups,90-105)', 'lag(pickups,105-120)'],
dtype='object')
picksup15.head()
| temp | feelslike | humidity | precip | snowdepth | windspeed | visibility | uvindex | enc_preciptype | enc_conditions | ... | holiday | month | hour | season_Spring | season_Summer | season_Winter | lag(pickups,60-75) | lag(pickups,75-90) | lag(pickups,90-105) | lag(pickups,105-120) | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 8 | 7.2 | 6.5 | 96.49 | 0.0 | 0.0 | 5.3 | 5.1 | 0.0 | 1.0 | 1.0 | ... | 1 | 1 | 2 | 0 | 0 | 1 | 30.0 | 29.0 | 36.0 | 30.0 |
| 9 | 7.2 | 6.5 | 96.49 | 0.0 | 0.0 | 5.3 | 5.1 | 0.0 | 1.0 | 1.0 | ... | 1 | 1 | 2 | 0 | 0 | 1 | 21.0 | 30.0 | 29.0 | 36.0 |
| 10 | 7.2 | 6.5 | 96.49 | 0.0 | 0.0 | 5.3 | 5.1 | 0.0 | 1.0 | 1.0 | ... | 1 | 1 | 2 | 0 | 0 | 1 | 22.0 | 21.0 | 30.0 | 29.0 |
| 11 | 7.2 | 6.5 | 96.49 | 0.0 | 0.0 | 5.3 | 5.1 | 0.0 | 1.0 | 1.0 | ... | 1 | 1 | 2 | 0 | 0 | 1 | 24.0 | 22.0 | 21.0 | 30.0 |
| 12 | 7.2 | 6.0 | 96.49 | 0.0 | 0.0 | 7.3 | 4.0 | 0.0 | 1.0 | 1.0 | ... | 1 | 1 | 3 | 0 | 0 | 1 | 25.0 | 24.0 | 22.0 | 21.0 |
5 rows × 22 columns
# again we use the same way as before for the prediction.
def ModelRunningResults(models,models_name, first_row, picks, weather_columns):
results = {}
table = [first_row]
pipe = main.PredictionPipeline(picks,weather_columns)
fi_dict = {}
predictions = {}
X_train, X_test, y_train, y_test = pipe.PredictionDataPreperation()
for model, model_name in zip(models,models_name):
CV_score, CV_MSE, test_score, test_MSE, fi, preds = pipe.BackTestingPrediction(model, model_name, X_train, X_test, y_train, y_test)
results[model_name] = {'Test Score':test_score, 'Test MSE':test_MSE, 'Features Importance':fi}
row = [model_name, test_score, test_MSE]
table.append(row)
fi_dict[model_name] = fi
predictions[model_name] = preds
return table, fi_dict, predictions, y_test['pickups']
models = [LinearRegression(),Ridge(),Lasso(),RandomForestRegressor(),ExtraTreesRegressor()]
models_name = ['LinearRegression','RidgeRegression','LassoRegression','RandomForestRegressor','ExtraTreesRegressor']
first_row = ['Model for 15','Final Test Score','Final Test RMSE']
weather_columns = ['feelslike', 'humidity', 'precip', 'snowdepth','windspeed', 'visibility', 'uvindex',]
print('The Prediction based on {} minutes intervals \n'.format(15))
table15, features_importance15, predictions15, y_test_values = ModelRunningResults(models,models_name, first_row, picksup15, weather_columns)
The Prediction based on 15 minutes intervals LinearRegression Train Score: 0.7967390913559503 Train RMSE : 12.059014561295006 Test Score: 0.7678266608091979 Test RMSE : 19.534192816987535 Train Score: 0.798613473595351 Train RMSE : 15.56566322296946 Test Score: 0.7275783193153926 Test RMSE : 29.20501428389684 Train Score: 0.7834422227821805 Train RMSE : 21.00335715087402 Test Score: 0.7622631938584622 Test RMSE : 38.24751715132423 Train Score: 0.8097419331175716 Train RMSE : 26.067600323335725 Test Score: 0.7580350998152583 Test RMSE : 45.09058202806411 Train Score: 0.8169608380320491 Train RMSE : 30.795296764346443 Test Score: 0.7226478363756992 Test RMSE : 56.765786061531 Train Score: 0.808438973735359 Train RMSE : 36.224173367026495 Test Score: 0.7341219317490368 Test RMSE : 55.286180206337406 RidgeRegression Train Score: 0.7966469104206775 Train RMSE : 12.061748695578132 Test Score: 0.7680371417424381 Test RMSE : 19.525336271643326 Train Score: 0.7986025567680527 Train RMSE : 15.566085111557069 Test Score: 0.7270993805345685 Test RMSE : 29.23067536849583 Train Score: 0.7834393865118484 Train RMSE : 21.0034946915444 Test Score: 0.7621118331119421 Test RMSE : 38.25969080605269 Train Score: 0.8097411235322332 Train RMSE : 26.067655784649208 Test Score: 0.7579879867400763 Test RMSE : 45.09497161634943 Train Score: 0.8169605522636212 Train RMSE : 30.795320803792468 Test Score: 0.7226119114626104 Test RMSE : 56.769462326160465 Train Score: 0.8084388430088998 Train RMSE : 36.2241857272054 Test Score: 0.7340995200678888 Test RMSE : 55.28851027842827 LassoRegression Train Score: 0.7679306616211481 Train RMSE : 12.885277328926222 Test Score: 0.7230410714836937 Test RMSE : 21.335212931517766 Train Score: 0.7827439127731888 Train RMSE : 16.167333652799933 Test Score: 0.7105786957428444 Test RMSE : 30.102449546554194 Train Score: 0.7722358945745089 Train RMSE : 21.53993884715373 Test Score: 0.7365967413895605 Test RMSE : 40.25924330255768 Train Score: 0.8005016290761334 Train RMSE : 26.693110885918067 Test Score: 0.7416398899118163 Test RMSE : 46.593183638714066 Train Score: 0.8080419117085759 Train RMSE : 31.53665247372204 Test Score: 0.7078478510443837 Test RMSE : 58.26066260544607 Train Score: 0.8007111487793442 Train RMSE : 36.947614869415276 Test Score: 0.720484245187321 Test RMSE : 56.686347337546124 RandomForestRegressor Train Score: 0.9885347258752467 Train RMSE : 2.8640270115538993 Test Score: 0.7783352514891492 Test RMSE : 19.086997325733844 Train Score: 0.9891046259231664 Train RMSE : 3.6205412272794995 Test Score: 0.8208688836785352 Test RMSE : 23.682203310859897 Train Score: 0.9905454688185364 Train RMSE : 4.388557814328227 Test Score: 0.8486086643963134 Test RMSE : 30.521477005199245 Train Score: 0.9922406034528986 Train RMSE : 5.264334626781486 Test Score: 0.8805919209942136 Test RMSE : 31.675721177046963 Train Score: 0.9934215183665548 Train RMSE : 5.83814957440769 Test Score: 0.8795625195872971 Test RMSE : 37.40689905683992 Train Score: 0.9937835782546295 Train RMSE : 6.525520248927577 Test Score: 0.9251064437107086 Test RMSE : 29.342529800552747 ExtraTreesRegressor Train Score: 0.9999881479896013 Train RMSE : 0.09208330538310691 Test Score: 0.7406227035243889 Test RMSE : 20.646920047571506 Train Score: 0.9999902764465254 Train RMSE : 0.10815956396777494 Test Score: 0.8257340193058693 Test RMSE : 23.358389518077427 Train Score: 0.9999956778631767 Train RMSE : 0.09383202312369007 Test Score: 0.8599575190298598 Test RMSE : 29.35519259902823 Train Score: 0.999996400419363 Train RMSE : 0.11338496418943322 Test Score: 0.9068699851345238 Test RMSE : 27.973994927509032 Train Score: 0.9999980187268799 Train RMSE : 0.10131752167707815 Test Score: 0.8881119126180832 Test RMSE : 36.054776027007 Train Score: 0.9999970741946052 Train RMSE : 0.1415688239718542 Test Score: 0.9325814363986151 Test RMSE : 27.839733609348997
print(tabulate(table15, headers='firstrow',colalign=("left",), floatfmt=".4f"))
Model for 15 Final Test Score Final Test RMSE --------------------- ------------------ ----------------- LinearRegression 0.7457 64.3816 RidgeRegression 0.7456 64.3845 LassoRegression 0.7320 66.0940 RandomForestRegressor 0.8535 48.8664 ExtraTreesRegressor 0.8617 47.4819
print('Featutes importance 15 min\n')
for model in models_name:
print(model,'\n')
for i,f in enumerate(features_importance15[model]):
print(f,'\n')
Featutes importance 15 min
LinearRegression
('temp', 0.6024888866301362)
('feelslike', 6.26520510397928)
('humidity', -4.608978238488977)
('precip', -0.5070179162736411)
('snowdepth', -0.5710002961629469)
('windspeed', 0.38631499547865483)
('visibility', -0.7583473971681893)
('uvindex', 1.739044628637104)
('enc_preciptype', 1.6232845323111729)
('enc_conditions', -6.112199665036933)
('workingday', 1.0192384071071545)
('holiday', -4.075833559872682)
('month', 2.60073778396347)
('hour', 2.54461359949013)
('season_Spring', 0.664187548208373)
('season_Summer', -3.1234670379213214)
('season_Winter', 2.4592794897129595)
('lag(pickups,60-75)', 94.59948453514384)
('lag(pickups,75-90)', 6.245539712008118)
('lag(pickups,90-105)', -18.425126139825988)
('lag(pickups,105-120)', -17.788378933087966)
RidgeRegression
('temp', 0.607006459235016)
('feelslike', 6.217806546727401)
('humidity', -4.60816350291144)
('precip', -0.5079885110435244)
('snowdepth', -0.5706916917461105)
('windspeed', 0.3836127201282448)
('visibility', -0.7576247433329227)
('uvindex', 1.741243579209775)
('enc_preciptype', 1.6062719144863893)
('enc_conditions', -6.090187554403827)
('workingday', 1.0205218845706623)
('holiday', -4.069837514279931)
('month', 2.599381122987555)
('hour', 2.545709357975033)
('season_Spring', 0.666023829719736)
('season_Summer', -3.120687042502916)
('season_Winter', 2.4546632130815307)
('lag(pickups,60-75)', 94.47438156563808)
('lag(pickups,75-90)', 6.321303369464182)
('lag(pickups,90-105)', -18.3701385686073)
('lag(pickups,105-120)', -17.800827519073017)
LassoRegression
('temp', 1.3837340199453547)
('feelslike', 0.0)
('humidity', -3.958300221116922)
('precip', -0.44030216641500464)
('snowdepth', -0.0)
('windspeed', 0.0)
('visibility', 0.0)
('uvindex', 2.011869148580724)
('enc_preciptype', -0.0)
('enc_conditions', -0.0)
('workingday', 0.0)
('holiday', -0.0)
('month', 0.6229354817280872)
('hour', 2.8728283264955916)
('season_Spring', 0.0)
('season_Summer', -0.0)
('season_Winter', -0.0)
('lag(pickups,60-75)', 73.23673256001294)
('lag(pickups,75-90)', 0.0)
('lag(pickups,90-105)', -0.0)
('lag(pickups,105-120)', -9.33317404321983)
RandomForestRegressor
('temp', 0.017176206471231886)
('feelslike', 0.015883821344031726)
('humidity', 0.008380482339496828)
('precip', 0.0012769759125043848)
('snowdepth', 0.0004748274706863692)
('windspeed', 0.005460302244481578)
('visibility', 0.0011652076016515206)
('uvindex', 0.01396715422235506)
('enc_preciptype', 0.00043378561410434285)
('enc_conditions', 0.00029831840004859353)
('workingday', 0.01435883488039438)
('holiday', 0.0008295834061049048)
('month', 0.007960045038326577)
('hour', 0.1506278900546934)
('season_Spring', 0.0017413257756239112)
('season_Summer', 0.0014645746557982757)
('season_Winter', 0.00016076477563864745)
('lag(pickups,60-75)', 0.7277731251743327)
('lag(pickups,75-90)', 0.010968728109446135)
('lag(pickups,90-105)', 0.008749436755323818)
('lag(pickups,105-120)', 0.010848609753724946)
ExtraTreesRegressor
('temp', 0.01619541391130922)
('feelslike', 0.017564330493834556)
('humidity', 0.005852335798036597)
('precip', 0.0013279866324129805)
('snowdepth', 0.0009026574948589324)
('windspeed', 0.005013281411851685)
('visibility', 0.0020605915570239552)
('uvindex', 0.02688778537393603)
('enc_preciptype', 0.001963059974034718)
('enc_conditions', 0.0014177269336380592)
('workingday', 0.02422888303509992)
('holiday', 0.0024272983273975657)
('month', 0.01493875137376065)
('hour', 0.17040575587115878)
('season_Spring', 0.002512976011789358)
('season_Summer', 0.01178164061194472)
('season_Winter', 0.006147163007772459)
('lag(pickups,60-75)', 0.29081551029125985)
('lag(pickups,75-90)', 0.24283072305185938)
('lag(pickups,90-105)', 0.09618087452140923)
('lag(pickups,105-120)', 0.058545254315611216)
plt.figure(figsize=(12,8))
plt.title('15 Minutes real VS predicted values')
plt.plot(y_test_values.values, label='Real Values')
plt.plot(predictions15['ExtraTreesRegressor'], label="Extra Tree Regressor's predicted values")
plt.legend(loc="upper left")
plt.tight_layout()
models = [LinearRegression(),Ridge(),Lasso(),RandomForestRegressor(),ExtraTreesRegressor()]
models_name = ['LinearRegression','RidgeRegression','LassoRegression','RandomForestRegressor','ExtraTreesRegressor']
first_row = ['Model for 30min','Final Test Score','Final Test RMSE']
weather_columns = ['feelslike', 'humidity', 'precip', 'snowdepth','windspeed', 'visibility', 'uvindex',]
print('The Prediction based on {} minutes intervals \n'.format(30))
table30, features_importance30, predictions30, y_test_values = ModelRunningResults(models,models_name, first_row, picksup30,weather_columns)
The Prediction based on 30 minutes intervals LinearRegression Train Score: 0.7916988514570253 Train RMSE : 24.144535259208684 Test Score: 0.7528243414860464 Test RMSE : 39.93577968849003 Train Score: 0.7902384081914835 Train RMSE : 31.47027890689871 Test Score: 0.7001335091165167 Test RMSE : 60.82898330353259 Train Score: 0.7673034288132181 Train RMSE : 43.21348446689266 Test Score: 0.7342247578750365 Test RMSE : 80.33922590706871 Train Score: 0.7923595406725805 Train RMSE : 54.1311820999288 Test Score: 0.7307473201520153 Test RMSE : 94.49652931280119 Train Score: 0.7992422636416721 Train RMSE : 64.14335393358529 Test Score: 0.6860466968121985 Test RMSE : 120.03778272071021 Train Score: 0.7877254735236627 Train RMSE : 75.86301654237279 Test Score: 0.6974078611163366 Test RMSE : 117.23257648345069 RidgeRegression Train Score: 0.7911255102452377 Train RMSE : 24.177740896003147 Test Score: 0.7521751994080879 Test RMSE : 39.988185732104334 Train Score: 0.7901429702956984 Train RMSE : 31.477437309016516 Test Score: 0.698655634791151 Test RMSE : 60.978695098204334 Train Score: 0.7672775880135735 Train RMSE : 43.215883814654184 Test Score: 0.7337552827417623 Test RMSE : 80.41015167669966 Train Score: 0.792352632828967 Train RMSE : 54.132082518449906 Test Score: 0.7305897090925999 Test RMSE : 94.52418274076382 Train Score: 0.7992398880135131 Train RMSE : 64.1437334464992 Test Score: 0.6859860732631154 Test RMSE : 120.04937164981952 Train Score: 0.7877245434165288 Train RMSE : 75.86318274379312 Test Score: 0.6973446189652455 Test RMSE : 117.24482672398332 LassoRegression Train Score: 0.7806753820153058 Train RMSE : 24.775173749213693 Test Score: 0.7354668794392443 Test RMSE : 41.31419979731381 Train Score: 0.7834998243179849 Train RMSE : 31.97177392692763 Test Score: 0.6923223195570318 Test RMSE : 61.61615380942736 Train Score: 0.7575168715953982 Train RMSE : 44.11284396938471 Test Score: 0.7132899888017468 Test RMSE : 83.44336519368458 Train Score: 0.7819477786874753 Train RMSE : 55.47173866938466 Test Score: 0.7119985243936475 Test RMSE : 97.73119231378844 Train Score: 0.7924407050419258 Train RMSE : 65.22087379236855 Test Score: 0.6758086648526618 Test RMSE : 121.97930031919793 Train Score: 0.7839103726643295 Train RMSE : 76.54170418033672 Test Score: 0.6909026785749588 Test RMSE : 118.4860195732955 RandomForestRegressor Train Score: 0.9893988110148058 Train RMSE : 5.446911341969744 Test Score: 0.7622424185563438 Test RMSE : 39.16755888724092 Train Score: 0.9900438722354126 Train RMSE : 6.856189891155122 Test Score: 0.7982996369850404 Test RMSE : 49.88844058006137 Train Score: 0.9903366355497242 Train RMSE : 8.806198570361216 Test Score: 0.8368703085203455 Test RMSE : 62.94147106899112 Train Score: 0.9916969679359448 Train RMSE : 10.8245506831767 Test Score: 0.8765165777866392 Test RMSE : 63.99413441934886 Train Score: 0.9928934017618355 Train RMSE : 12.068308445731253 Test Score: 0.8688934705013382 Test RMSE : 77.57067973265784 Train Score: 0.9933436859109278 Train RMSE : 13.433775012881545 Test Score: 0.9325395866229212 Test RMSE : 55.353360022039666 ExtraTreesRegressor Train Score: 0.999971225175704 Train RMSE : 0.28377841475215354 Test Score: 0.7267062206635739 Test RMSE : 41.99273788504205 Train Score: 0.9999834728740944 Train RMSE : 0.279342023429609 Test Score: 0.8283033405025302 Test RMSE : 46.028573099024676 Train Score: 0.9999922865230192 Train RMSE : 0.24879943424387876 Test Score: 0.857849626174817 Test RMSE : 58.75493967250872 Train Score: 0.9999895991928753 Train RMSE : 0.38311137081290497 Test Score: 0.9102811785388182 Test RMSE : 54.54784788733452 Train Score: 0.9999942671483768 Train RMSE : 0.34276847345439654 Test Score: 0.883656375351756 Test RMSE : 73.07296529981751 Train Score: 0.9999937845718752 Train RMSE : 0.4105033366561151 Test Score: 0.9407356213078313 Test RMSE : 51.881957918363845
print('\n\n',tabulate(table30, headers='firstrow',colalign=("left",), floatfmt=".4f"))
Model for 30min Final Test Score Final Test RMSE --------------------- ------------------ ----------------- LinearRegression 0.7110 136.5539 RidgeRegression 0.7109 136.5715 LassoRegression 0.7027 138.5113 RandomForestRegressor 0.8531 97.3416 ExtraTreesRegressor 0.8658 93.0674
print('Featutes importance 30 min')
for model in models_name:
print(model)
for i,f in enumerate(features_importance30[model]):
print(f,'\n')
Featutes importance 30 min
LinearRegression
('temp', 1.329676440151578)
('feelslike', 13.922778872263526)
('humidity', -10.077070357527852)
('precip', -1.3743423186592574)
('snowdepth', -1.3604580579717298)
('windspeed', 0.7741980132884955)
('visibility', -1.551082325036157)
('uvindex', 4.154268332833975)
('enc_preciptype', 3.5316671075221446)
('enc_conditions', -13.157488247304405)
('workingday', 2.416946768506123)
('holiday', -9.018115945554827)
('month', 6.078926344874121)
('hour', 5.967576069082331)
('season_Spring', 1.4632379282080612)
('season_Summer', -7.036291342679622)
('season_Winter', 5.573053414471628)
('lag(pickups,60-90)', 209.02755228539414)
('lag(pickups,90-120)', -89.1832651127256)
RidgeRegression
('temp', 1.3507389493537547)
('feelslike', 13.699986572382194)
('humidity', -10.071972029456875)
('precip', -1.3773198726098177)
('snowdepth', -1.3589199481265815)
('windspeed', 0.7614134569059257)
('visibility', -1.5489746760182053)
('uvindex', 4.1713356627477305)
('enc_preciptype', 3.457583729826424)
('enc_conditions', -13.070035485958071)
('workingday', 2.4255649230109277)
('holiday', -8.989377790661157)
('month', 6.072373070337455)
('hour', 5.975412825127189)
('season_Spring', 1.4711331935829608)
('season_Summer', -7.023588893841631)
('season_Winter', 5.552455700305834)
('lag(pickups,60-90)', 208.4519973023642)
('lag(pickups,90-120)', -88.63692337003205)
LassoRegression
('temp', 2.799923770376106)
('feelslike', 0.0)
('humidity', -9.14037908466114)
('precip', -1.7873065494990827)
('snowdepth', -0.4524396130043152)
('windspeed', -0.0)
('visibility', 0.0)
('uvindex', 4.922983625334093)
('enc_preciptype', -1.0707247393138364)
('enc_conditions', -0.0)
('workingday', 0.0)
('holiday', -0.0)
('month', 2.877928462791326)
('hour', 6.4217862151327365)
('season_Spring', 0.0)
('season_Summer', -0.0)
('season_Winter', -0.0)
('lag(pickups,60-90)', 171.53380146797505)
('lag(pickups,90-120)', -52.45636582571057)
RandomForestRegressor
('temp', 0.030575943611822677)
('feelslike', 0.026237392210002305)
('humidity', 0.009366156150003744)
('precip', 0.0015273487364883384)
('snowdepth', 0.0004978499158417112)
('windspeed', 0.005836695928534466)
('visibility', 0.0015022601400996752)
('uvindex', 0.008228431590348207)
('enc_preciptype', 0.0005754369073993471)
('enc_conditions', 0.00046866168771488525)
('workingday', 0.020581196961179535)
('holiday', 0.0011124043538061565)
('month', 0.024540178865352945)
('hour', 0.2019491256339715)
('season_Spring', 0.001631200233737006)
('season_Summer', 0.0012923168555575382)
('season_Winter', 0.0008177051024073054)
('lag(pickups,60-90)', 0.6467573951200805)
('lag(pickups,90-120)', 0.01650229999565218)
ExtraTreesRegressor
('temp', 0.02728681632033958)
('feelslike', 0.03485031841144097)
('humidity', 0.006397038909947372)
('precip', 0.001445567887505193)
('snowdepth', 0.000998613330413861)
('windspeed', 0.0052795383880636405)
('visibility', 0.0023278269908161477)
('uvindex', 0.034962179324092804)
('enc_preciptype', 0.0025670229464792)
('enc_conditions', 0.0017292467641555084)
('workingday', 0.030132909011356564)
('holiday', 0.002698755977555149)
('month', 0.034503259873879026)
('hour', 0.25275387512859326)
('season_Spring', 0.004012333128628575)
('season_Summer', 0.01030654080634168)
('season_Winter', 0.02412076318258276)
('lag(pickups,60-90)', 0.3718531814509455)
('lag(pickups,90-120)', 0.15177421216686338)
plt.figure(figsize=(12,8))
plt.title('30 Minutes real VS predicted values')
plt.plot(y_test_values.values, label='Real Values')
plt.plot(predictions30['ExtraTreesRegressor'], label="Extra Tree Regressor's predicted values")
plt.legend(loc="upper left")
plt.tight_layout()
models = [LinearRegression(),Ridge(),Lasso(),RandomForestRegressor(),ExtraTreesRegressor()]
models_name = ['LinearRegression','RidgeRegression','LassoRegression','RandomForestRegressor','ExtraTreesRegressor']
first_row = ['Model for 60','Final Test Score','Final Test RMSE']
weather_columns = ['feelslike', 'humidity', 'precip', 'snowdepth','windspeed', 'visibility', 'uvindex',]
print('The Prediction based on {} minutes intervals \n'.format(60))
table60, features_importance60, predictions60, y_test_values = ModelRunningResults(models,models_name, first_row, picksup60,weather_columns)
The Prediction based on 60 minutes intervals LinearRegression Train Score: 0.7723140113416842 Train RMSE : 49.94158624885183 Test Score: 0.7153088988086465 Test RMSE : 84.76710260466837 Train Score: 0.7648774967217742 Train RMSE : 65.9329964370649 Test Score: 0.6341434746119116 Test RMSE : 132.9683021900222 Train Score: 0.7258384567790362 Train RMSE : 92.91183824603472 Test Score: 0.6633138178965355 Test RMSE : 179.02097732529236 Train Score: 0.7488232638631542 Train RMSE : 118.07153656735717 Test Score: 0.6589396707785806 Test RMSE : 210.60498987647074 Train Score: 0.7541820692221899 Train RMSE : 140.83831134448414 Test Score: 0.6016012515328 Test RMSE : 267.51864937585776 Train Score: 0.7381642386179201 Train RMSE : 167.16391811117398 Test Score: 0.6075664321918144 Test RMSE : 264.37469166676243 RidgeRegression Train Score: 0.7719414267866331 Train RMSE : 49.98243166434174 Test Score: 0.7170096219851203 Test RMSE : 84.51352718696285 Train Score: 0.7648555136650136 Train RMSE : 65.9360786067975 Test Score: 0.6347748142159204 Test RMSE : 132.8535244056351 Train Score: 0.7258343913689649 Train RMSE : 92.91252711603133 Test Score: 0.6633527636855447 Test RMSE : 179.01062300245968 Train Score: 0.7488221021818433 Train RMSE : 118.07180960486235 Test Score: 0.6589696773974605 Test RMSE : 210.59572511870059 Train Score: 0.7541818474767054 Train RMSE : 140.8383748676219 Test Score: 0.6016653574712004 Test RMSE : 267.49712543275064 Train Score: 0.7381641205105571 Train RMSE : 167.1639558128371 Test Score: 0.6076013820384698 Test RMSE : 264.36291889625534 LassoRegression Train Score: 0.7679170231761412 Train RMSE : 50.421507097707575 Test Score: 0.7129749318657512 Test RMSE : 85.1138640167119 Train Score: 0.7634190714483677 Train RMSE : 66.13716595335121 Test Score: 0.63833088532088 Test RMSE : 132.2051679690996 Train Score: 0.7247352890254046 Train RMSE : 93.09857932343242 Test Score: 0.6599256082466208 Test RMSE : 179.919502790868 Train Score: 0.7478399543902202 Train RMSE : 118.30242469779122 Test Score: 0.6570877791430443 Test RMSE : 211.17598800819783 Train Score: 0.753303297728217 Train RMSE : 141.08982733220873 Test Score: 0.6061191772937475 Test RMSE : 265.9974655144519 Train Score: 0.7375033983717336 Train RMSE : 167.37473543770182 Test Score: 0.6103241830645585 Test RMSE : 263.4441329897368 RandomForestRegressor Train Score: 0.9884767507465891 Train RMSE : 11.235221979916792 Test Score: 0.7887825860828337 Test RMSE : 73.01385067252488 Train Score: 0.9894669737326354 Train RMSE : 13.955078975338004 Test Score: 0.7887171913937722 Test RMSE : 101.0473204379934 Train Score: 0.9881663529595287 Train RMSE : 19.303101325165017 Test Score: 0.8372558543576275 Test RMSE : 124.46406032633142 Train Score: 0.9902847135631178 Train RMSE : 23.221128057259644 Test Score: 0.8876823810064277 Test RMSE : 120.85844367604483 Train Score: 0.9925402901329132 Train RMSE : 24.534374955561713 Test Score: 0.8752685280770125 Test RMSE : 149.68666211317338 Train Score: 0.9934141331938469 Train RMSE : 26.51151416588437 Test Score: 0.9433229978160601 Test RMSE : 100.47088767813956 ExtraTreesRegressor Train Score: 1.0 Train RMSE : 0.0 Test Score: 0.7288693377854436 Test RMSE : 82.72365413934921 Train Score: 1.0 Train RMSE : 0.0 Test Score: 0.816770192742345 Test RMSE : 94.10025127654843 Train Score: 1.0 Train RMSE : 0.0 Test Score: 0.8525562681369814 Test RMSE : 118.46892142503498 Train Score: 1.0 Train RMSE : 0.0 Test Score: 0.9125797070771234 Test RMSE : 106.62503382495146 Train Score: 0.9999999999828875 Train RMSE : 0.0011750889429106943 Test Score: 0.8518597851421958 Test RMSE : 163.12914708904722 Train Score: 1.0 Train RMSE : 0.0 Test Score: 0.9491021437549642 Test RMSE : 95.21087342391709
print(tabulate(table60,headers='firstrow',colalign=("left",), floatfmt=".4f"))
Model for 60 Final Test Score Final Test RMSE --------------------- ------------------ ----------------- LinearRegression 0.6159 311.6947 RidgeRegression 0.6159 311.7024 LassoRegression 0.6145 312.2669 RandomForestRegressor 0.8635 185.7917 ExtraTreesRegressor 0.8664 183.8122
for model in models_name:
print(model)
for f in features_importance60[model]:
print(f,'\n')
LinearRegression
('temp', 4.156368005610714)
('feelslike', 19.685509046954916)
('humidity', -21.767241089839718)
('precip', -2.6836484973099437)
('snowdepth', -3.351707292266602)
('windspeed', 0.5465150309496023)
('visibility', -4.089720798550904)
('uvindex', 17.15704725256907)
('enc_preciptype', 7.053034312982888)
('enc_conditions', -36.14536400355487)
('workingday', 9.0607814809872)
('holiday', -18.898029980931764)
('month', 15.062909212476457)
('hour', 17.82101776307295)
('season_Spring', 3.189645767043337)
('season_Summer', -17.213933985473762)
('season_Winter', 14.024288218430362)
('lag(pickups,60-120)', 208.00118264150194)
RidgeRegression
('temp', 4.207516024999784)
('feelslike', 19.15219246228866)
('humidity', -21.76233615001592)
('precip', -2.701631490219897)
('snowdepth', -3.3438429065260555)
('windspeed', 0.5157370704350018)
('visibility', -4.067181922401089)
('uvindex', 17.14136352504782)
('enc_preciptype', 6.690138502387665)
('enc_conditions', -35.62776781801748)
('workingday', 9.058654162998021)
('holiday', -18.790381798785486)
('month', 15.028347565671263)
('hour', 17.8232921738074)
('season_Spring', 3.215283970774547)
('season_Summer', -17.125806470770904)
('season_Winter', 13.910522499841429)
('lag(pickups,60-120)', 207.928161664016)
LassoRegression
('temp', 6.201569139097055)
('feelslike', 0.0)
('humidity', -19.802150465848364)
('precip', -3.2232892048893693)
('snowdepth', -1.898658414789273)
('windspeed', -0.0)
('visibility', -0.0)
('uvindex', 16.446324162371155)
('enc_preciptype', -0.0)
('enc_conditions', -16.422653002963763)
('workingday', 4.13357401559078)
('holiday', -0.0)
('month', 8.552859545244042)
('hour', 17.85740158317223)
('season_Spring', 3.8995127923317052)
('season_Summer', -0.0)
('season_Winter', -0.0)
('lag(pickups,60-120)', 206.5926344464154)
RandomForestRegressor
('temp', 0.02635819571519048)
('feelslike', 0.024279040124330634)
('humidity', 0.009442080741672153)
('precip', 0.0014293666419875365)
('snowdepth', 0.0005634456563335194)
('windspeed', 0.0059727507857446375)
('visibility', 0.0018178622748444085)
('uvindex', 0.006557716551028671)
('enc_preciptype', 0.000616221372649643)
('enc_conditions', 0.0004629807122680295)
('workingday', 0.02106815610008448)
('holiday', 0.0006987529101955144)
('month', 0.050183291186634366)
('hour', 0.25327388364135833)
('season_Spring', 0.0016023175925820863)
('season_Summer', 0.0013296905958063187)
('season_Winter', 0.0005477411070474009)
('lag(pickups,60-120)', 0.5937965062902417)
ExtraTreesRegressor
('temp', 0.05416949349687749)
('feelslike', 0.04473042388588976)
('humidity', 0.008184866457713255)
('precip', 0.0014596821895102553)
('snowdepth', 0.001300009673170664)
('windspeed', 0.005621616844478159)
('visibility', 0.0029748151583535065)
('uvindex', 0.040561546693719676)
('enc_preciptype', 0.003193681056789635)
('enc_conditions', 0.002711116893801144)
('workingday', 0.037514136593258626)
('holiday', 0.0031573177751395096)
('month', 0.057991034760954484)
('hour', 0.3236522064877543)
('season_Spring', 0.005262268948260527)
('season_Summer', 0.013401580242393523)
('season_Winter', 0.029917407534564237)
('lag(pickups,60-120)', 0.36419679530737126)
plt.figure(figsize=(12,8))
plt.title('60 Minutes real VS predicted values')
plt.plot(y_test_values.values, label='Real Values')
plt.plot(predictions60['ExtraTreesRegressor'], label="Extra Tree Regressor's predicted values")
plt.legend(loc="upper left")
plt.tight_layout()
models = [LinearRegression(),Ridge(),Lasso(),RandomForestRegressor(),ExtraTreesRegressor()]
models_name = ['LinearRegression','RidgeRegression','LassoRegression','RandomForestRegressor','ExtraTreesRegressor']
first_row = ['Model for 120','Final Test Score','Final Test RMSE']
weather_columns = ['feelslike', 'humidity', 'precip', 'snowdepth','windspeed', 'visibility', 'uvindex',]
print('The Prediction based on {} minutes intervals \n'.format(120))
table120, features_importance120, predictions120, y_test_values = ModelRunningResults(models,models_name, first_row, picksup120,weather_columns)
The Prediction based on 120 minutes intervals LinearRegression Train Score: 0.827544133887939 Train RMSE : 84.73382182824533 Test Score: 0.7864419470397604 Test RMSE : 142.60847534096703 Train Score: 0.8216405502516233 Train RMSE : 111.80648973129394 Test Score: 0.7123487139231477 Test RMSE : 228.0623094846462 Train Score: 0.7894548066652869 Train RMSE : 158.3529315358823 Test Score: 0.723799588489872 Test RMSE : 315.44623042203176 Train Score: 0.80191992841065 Train RMSE : 204.79404170932776 Test Score: 0.7300424078837345 Test RMSE : 363.2793467580036 Train Score: 0.8081679979460145 Train RMSE : 243.13597020881122 Test Score: 0.687466646174722 Test RMSE : 455.32187387861836 Train Score: 0.7974419949305895 Train RMSE : 286.61376110833174 Test Score: 0.6953585415475871 Test RMSE : 447.9216440760372 RidgeRegression Train Score: 0.8265672856627138 Train RMSE : 84.97346343590304 Test Score: 0.7830787206163917 Test RMSE : 143.7270259277467 Train Score: 0.8215326984491336 Train RMSE : 111.84028863879082 Test Score: 0.7132230480759398 Test RMSE : 227.71544084410726 Train Score: 0.7894301122864084 Train RMSE : 158.36221769571176 Test Score: 0.7239977817104307 Test RMSE : 315.3330326836508 Train Score: 0.8019156396362274 Train RMSE : 204.79625876904188 Test Score: 0.7300597535780918 Test RMSE : 363.2676756031204 Train Score: 0.8081667135799098 Train RMSE : 243.1367841373345 Test Score: 0.687560608111855 Test RMSE : 455.2534233582523 Train Score: 0.7974413179828921 Train RMSE : 286.614240038709 Test Score: 0.6953813562723354 Test RMSE : 447.9048712434626 LassoRegression Train Score: 0.8260323010359799 Train RMSE : 85.10442054438381 Test Score: 0.7843315498504838 Test RMSE : 143.31137706684405 Train Score: 0.8211641907620523 Train RMSE : 111.95569564195728 Test Score: 0.7141650070214849 Test RMSE : 227.3411516576514 Train Score: 0.7887398614044241 Train RMSE : 158.62156205998926 Test Score: 0.7232831335521057 Test RMSE : 315.74101216623274 Train Score: 0.8014348316934754 Train RMSE : 205.04465795543797 Test Score: 0.728817438217386 Test RMSE : 364.1026288513845 Train Score: 0.8077796857814761 Train RMSE : 243.3819274107215 Test Score: 0.689546420340513 Test RMSE : 453.8043600903156 Train Score: 0.7971086531056479 Train RMSE : 286.8494987176588 Test Score: 0.6969307022035787 Test RMSE : 446.7643562806214 RandomForestRegressor Train Score: 0.9890871659237729 Train RMSE : 21.315061275932695 Test Score: 0.7794724296534692 Test RMSE : 144.91682376802183 Train Score: 0.9891505218622803 Train RMSE : 27.57549869006284 Test Score: 0.7511241702706647 Test RMSE : 212.13470203194666 Train Score: 0.9871124402545183 Train RMSE : 39.17768661908184 Test Score: 0.8519473240504186 Test RMSE : 230.9517790374336 Train Score: 0.9896534882055278 Train RMSE : 46.8051782503746 Test Score: 0.910899591206839 Test RMSE : 208.70496619256778 Train Score: 0.9917731989430054 Train RMSE : 50.350545752129364 Test Score: 0.8650083003567919 Test RMSE : 299.2429042326837 Train Score: 0.9926348527110879 Train RMSE : 54.6528797809923 Test Score: 0.9340705288693139 Test RMSE : 208.37582946148873 ExtraTreesRegressor Train Score: 1.0 Train RMSE : 0.0 Test Score: 0.7894800416653267 Test RMSE : 141.59046176910397 Train Score: 1.0 Train RMSE : 0.0 Test Score: 0.8047315817012333 Test RMSE : 187.90414574861336 Train Score: 1.0 Train RMSE : 0.0 Test Score: 0.8816175830875685 Test RMSE : 206.51746359098922 Train Score: 1.0 Train RMSE : 0.0 Test Score: 0.9289689327976157 Test RMSE : 186.34472567194567 Train Score: 1.0 Train RMSE : 0.0 Test Score: 0.8964512142044128 Test RMSE : 262.085403212282 Train Score: 1.0 Train RMSE : 0.0 Test Score: 0.9603976878078949 Test RMSE : 161.49824869302446
print('\n\n',tabulate(table120, headers='firstrow',colalign=("left",), floatfmt=".4f"))
Model for 120 Final Test Score Final Test RMSE --------------------- ------------------ ----------------- LinearRegression 0.6982 533.8585 RidgeRegression 0.6982 533.8845 LassoRegression 0.6974 534.5975 RandomForestRegressor 0.8574 367.0083 ExtraTreesRegressor 0.8818 334.1423
print('Featutes importance 120 min')
for model in models_name:
print(model)
for f in features_importance120[model]:
print(f,'\n')
Featutes importance 120 min
LinearRegression
('temp', 7.715164523087931)
('feelslike', 16.839300774010205)
('humidity', -27.055983569360677)
('precip', 1.4076929806000404)
('snowdepth', -6.246740715328282)
('windspeed', -0.5837699342582109)
('visibility', -3.2351870283244404)
('uvindex', 46.74277072022949)
('enc_preciptype', 35.39206641794493)
('enc_conditions', -98.07723526647614)
('workingday', 18.475360485487506)
('holiday', -26.841653556261768)
('month', 24.95088358461382)
('hour', 35.45766330076451)
('season_Spring', 4.039215346444744)
('season_Summer', -26.97762164808973)
('season_Winter', 22.93840630164495)
('lag(pickups,0-120)', 465.60801906019265)
RidgeRegression
('temp', 7.8057434595585375)
('feelslike', 15.921880507175306)
('humidity', -27.117227988416147)
('precip', 1.2396746568369312)
('snowdepth', -6.216702131938699)
('windspeed', -0.6445665538109173)
('visibility', -3.1108948282137625)
('uvindex', 46.60755819695812)
('enc_preciptype', 32.5191562591084)
('enc_conditions', -93.83392968085636)
('workingday', 18.455036916966176)
('holiday', -26.57652330105566)
('month', 24.902172627927566)
('hour', 35.46779342430362)
('season_Spring', 4.089308332223412)
('season_Summer', -26.706496737305038)
('season_Winter', 22.61718840501132)
('lag(pickups,0-120)', 465.2249928486695)
LassoRegression
('temp', 9.384733932697955)
('feelslike', 0.0)
('humidity', -25.60208294722853)
('precip', -0.7278516441006497)
('snowdepth', -4.075241828891246)
('windspeed', -0.21729379167732446)
('visibility', 0.0)
('uvindex', 45.87689155920444)
('enc_preciptype', -0.0)
('enc_conditions', -40.96190844831285)
('workingday', 13.71136786653646)
('holiday', -0.0)
('month', 16.61916503439068)
('hour', 35.46574280941973)
('season_Spring', 3.045602561469723)
('season_Summer', -3.9058095922992067)
('season_Winter', -0.0)
('lag(pickups,0-120)', 464.1782619054718)
RandomForestRegressor
('temp', 0.038341193359478526)
('feelslike', 0.028653729928481475)
('humidity', 0.007327234987434809)
('precip', 0.0015077319863973943)
('snowdepth', 0.000546654094840015)
('windspeed', 0.005450860048314915)
('visibility', 0.001419269468664589)
('uvindex', 0.01216617040231617)
('enc_preciptype', 0.00043345701447008035)
('enc_conditions', 0.0004958129893287809)
('workingday', 0.01340360183703311)
('holiday', 0.0004498123440239543)
('month', 0.015321331634647219)
('hour', 0.2062772757953746)
('season_Spring', 0.0013631448615807555)
('season_Summer', 0.0011486612542913682)
('season_Winter', 0.0025753330379143925)
('lag(pickups,0-120)', 0.6631187249554079)
ExtraTreesRegressor
('temp', 0.03497236708231175)
('feelslike', 0.03789535229307001)
('humidity', 0.004898947339078277)
('precip', 0.002411955762450693)
('snowdepth', 0.001175733117201994)
('windspeed', 0.00414065871311306)
('visibility', 0.002798378854138784)
('uvindex', 0.03323665160819293)
('enc_preciptype', 0.002145941993493344)
('enc_conditions', 0.0016189245970395288)
('workingday', 0.028847511838544237)
('holiday', 0.002036244589979809)
('month', 0.052038512216927416)
('hour', 0.3062013794918345)
('season_Spring', 0.006543416129544077)
('season_Summer', 0.008912136662068335)
('season_Winter', 0.04037773333069405)
('lag(pickups,0-120)', 0.42974815438031727)
plt.figure(figsize=(12,8))
plt.title('120 Minutes real VS predicted values')
plt.plot(y_test_values.values, label='Real Values')
plt.plot(predictions120['ExtraTreesRegressor'], label="Extra Tree Regressor's predicted values")
plt.legend(loc="upper left")
plt.tight_layout()
Like before, the best algorithms are the non-linear ensemble models and they are also slighlty better than before. Backtesting strategy confirm our prediction, and the small overfitting remains. Definitely the most important features remain the lags of the intervals for both linear and non linear models with the hour following. However, linear models seems to be more sensitove also in weather columns like uvindex, humidity, conditions. Another inference, is that this sensitive is smoother in smaller time intervals than the bigger one. All in all, champion of models is again the ExtraTreesRegressor with the lowest RMSE and highest score and the linear models are better to figure the trend in smaller intervals when on the other hand the ensmble are slightly better on the bigger.
And as we promised, let's start the most difficult part of the project! The prediction in station-clusters level. In this part we try to identify possible clusters in bike stations and predict their rides. We won't dive a lot into, just to have something left for the future :)
clustering_data = pd.read_csv('BikeSharing_Bluebikes2022.csv', index_col=0)
/home/georgep/anaconda3/envs/MachineLearning/lib/python3.8/site-packages/numpy/lib/arraysetops.py:580: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison mask |= (ar1 == a)
#transforming the dates from object to datetime
for date_column in ['starttime','stoptime']:
clustering_data[date_column] = pd.to_datetime(clustering_data[date_column], format='%Y-%m-%d %H:%M:%S')
Using the Elbow Curve algorithm to predict the clusters of stations, as in cluster analysis elbow curve is a heuristic used in determining the number of clusters in a data set. The method consists of plotting the explained variation as a function of the number of clusters and picking the elbow of the curve as the number of clusters to use.
K_clusters = range(1,10)
kmeans = [KMeans(n_clusters=i) for i in K_clusters]
Y_axis = np.array(data.loc[:,'start station latitude']).reshape(-1,1)
X_axis = np.array(data.loc[:,'start station longitude']).reshape(-1,1)
score = [kmeans[i].fit(Y_axis).score(Y_axis) for i in range(len(kmeans))]
# Visualize
plt.plot(K_clusters, score)
plt.xlabel('Number of Clusters')
plt.ylabel('Score')
plt.title('Elbow Curve')
plt.show()
After the Elbow Curve algorithm, we select 5 clusters as we see this is approximatelly the threshold of optimal value of clusters, from 5-7.
# creates 5 clusters using k-means clustering algorithm.
kmeans = KMeans(n_clusters = 5, init ='k-means++')
kmeans.fit(clustering_data.loc[:,['start station latitude','start station longitude']]) # Compute k-means clustering.
clustering_data['cluster_label'] = kmeans.fit_predict(clustering_data.loc[:,['start station latitude','start station longitude']])
centers = kmeans.cluster_centers_ # Coordinates of cluster centers.
labels = kmeans.predict(clustering_data.loc[:,['start station latitude','start station longitude']]) # Labels of each point
The number of rides per cluster.
clustering_data['cluster_label'].value_counts()
1 647412 0 640150 2 550503 3 463484 4 4186 Name: cluster_label, dtype: int64
clustering_data = clustering_data.join(clustering_data['cluster_label'].value_counts(), on='cluster_label', lsuffix='', rsuffix='count')
clustering_data.head()
| tripduration | starttime | stoptime | start station id | start station name | start station latitude | start station longitude | end station id | end station name | end station latitude | end station longitude | bikeid | usertype | postal code | cluster_label | cluster_labelcount | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 597 | 2022-01-01 00:00:25.166 | 2022-01-01 00:10:22.192 | 178 | MIT Pacific St at Purrington St | 42.359573 | -71.101295 | 74 | Harvard Square at Mass Ave/ Dunster | 42.373268 | -71.118579 | 4923 | Subscriber | 02139 | 1 | 647412 |
| 1 | 411 | 2022-01-01 00:00:40.430 | 2022-01-01 00:07:32.198 | 189 | Kendall T | 42.362428 | -71.084955 | 178 | MIT Pacific St at Purrington St | 42.359573 | -71.101295 | 3112 | Subscriber | 02139 | 1 | 647412 |
| 2 | 476 | 2022-01-01 00:00:54.818 | 2022-01-01 00:08:51.668 | 94 | Main St at Austin St | 42.375603 | -71.064608 | 356 | Charlestown Navy Yard | 42.374125 | -71.054812 | 6901 | Customer | 02124 | 0 | 640150 |
| 3 | 466 | 2022-01-01 00:01:01.608 | 2022-01-01 00:08:48.235 | 94 | Main St at Austin St | 42.375603 | -71.064608 | 356 | Charlestown Navy Yard | 42.374125 | -71.054812 | 5214 | Customer | 02124 | 0 | 640150 |
| 4 | 752 | 2022-01-01 00:01:06.052 | 2022-01-01 00:13:38.230 | 19 | Park Dr at Buswell St | 42.347241 | -71.105301 | 41 | Packard's Corner - Commonwealth Ave at Brighto... | 42.352261 | -71.123831 | 2214 | Subscriber | 02215 | 2 | 550503 |
clustering_data.columns
Index(['tripduration', 'starttime', 'stoptime', 'start station id',
'start station name', 'start station latitude',
'start station longitude', 'end station id', 'end station name',
'end station latitude', 'end station longitude', 'bikeid', 'usertype',
'postal code', 'cluster_label', 'cluster_labelcount'],
dtype='object')
First visualization of the clusters, without layout.
plt.figure(figsize=(16,12))
plt.scatter(x = 'start station latitude', y = 'start station longitude', data=data, c=labels, s=50)
plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha=0.5)
<matplotlib.collections.PathCollection at 0x7fde00c6a0d0>
The second and much better visualization of them, using the map of Massachusetts as layout. For reasons of memory we have used only a sample of the data. Zooming in the city the prediction seems well.
plt.figure(figsize=(16,12))
main.get_location_interactive(clustering_data[['start station latitude','start station longitude','cluster_label', 'cluster_labelcount']].sample(n=100000))
<Figure size 1152x864 with 0 Axes>
There is also a small cluster up here in Salem which the plot could not interpret. Probably this is because of the few data we have plotted.
plt.figure(figsize=(16,12))
main.get_location_interactive(clustering_data[['start station latitude','start station longitude','cluster_label', 'cluster_labelcount']].sample(n=100000))
<Figure size 1152x864 with 0 Axes>
cdf15 = main.DatetimeInterval(clustering_data, freq='15Min')
cdf30 = main.DatetimeInterval(clustering_data, freq='30Min')
cdf60 = main.DatetimeInterval(clustering_data, freq='60Min')
cdf120 = main.DatetimeInterval(clustering_data, freq='120Min')
cdf120.head()
| usertype_Customer | usertype_Subscriber | cluster_label_0 | cluster_label_1 | cluster_label_2 | cluster_label_3 | cluster_label_4 | pickups | |
|---|---|---|---|---|---|---|---|---|
| starttime | ||||||||
| 2022-01-01 00:00:00 | 59.0 | 158.0 | 68.0 | 61.0 | 60.0 | 28.0 | 0.0 | 217.0 |
| 2022-01-01 02:00:00 | 86.0 | 82.0 | 57.0 | 39.0 | 48.0 | 24.0 | 0.0 | 168.0 |
| 2022-01-01 04:00:00 | 111.0 | 104.0 | 108.0 | 38.0 | 48.0 | 21.0 | 0.0 | 215.0 |
| 2022-01-01 06:00:00 | 189.0 | 94.0 | 111.0 | 51.0 | 71.0 | 50.0 | 0.0 | 283.0 |
| 2022-01-01 08:00:00 | 38.0 | 18.0 | 18.0 | 12.0 | 13.0 | 13.0 | 0.0 | 56.0 |
We need to transform the data in a way we could apply our model. Selecting only the 60 minutes inteval, we apply our new type of models, the Multi Output Regressors.
df = cdf60.copy()
minutes = 60
df['month'] = df.index.month
df['hour'] = df.index.hour
df['minute'] = df.index.minute
clusters_columns = [c for c in df.columns if c.startswith("cluster_label")]
for ind,c in enumerate(clusters_columns):
for i in range(int(((120/minutes)/2) +1), int((120/minutes) +1)):
df['lag(pickups of cluster_label_{},{}-{})'.format(ind,i*minutes-minutes,i*minutes)] = df[c].shift(i)
df = df.dropna().drop(['usertype_Customer','usertype_Subscriber'], axis=1)
This is our dataset's final format. Columns with the clusters pickups occuring during the index datetime and the lags with the pickups two hours ago. Based on that, we are going to create our train and test sets.
df.head()
| cluster_label_0 | cluster_label_1 | cluster_label_2 | cluster_label_3 | cluster_label_4 | pickups | month | hour | minute | lag(pickups of cluster_label_0,60-120) | lag(pickups of cluster_label_1,60-120) | lag(pickups of cluster_label_2,60-120) | lag(pickups of cluster_label_3,60-120) | lag(pickups of cluster_label_4,60-120) | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| starttime | ||||||||||||||
| 2022-01-01 02:00:00 | 31.0 | 20.0 | 12.0 | 10.0 | 0.0 | 73.0 | 1 | 2 | 0 | 45.0 | 39.0 | 24.0 | 17.0 | 0.0 |
| 2022-01-01 03:00:00 | 26.0 | 19.0 | 36.0 | 14.0 | 0.0 | 95.0 | 1 | 3 | 0 | 23.0 | 22.0 | 36.0 | 11.0 | 0.0 |
| 2022-01-01 04:00:00 | 46.0 | 18.0 | 18.0 | 9.0 | 0.0 | 91.0 | 1 | 4 | 0 | 31.0 | 20.0 | 12.0 | 10.0 | 0.0 |
| 2022-01-01 05:00:00 | 62.0 | 20.0 | 30.0 | 12.0 | 0.0 | 124.0 | 1 | 5 | 0 | 26.0 | 19.0 | 36.0 | 14.0 | 0.0 |
| 2022-01-01 06:00:00 | 66.0 | 26.0 | 43.0 | 39.0 | 0.0 | 174.0 | 1 | 6 | 0 | 46.0 | 18.0 | 18.0 | 9.0 | 0.0 |
X = df.drop(['pickups'], axis=1)
y = df[['month','hour','cluster_label_0','cluster_label_1','cluster_label_2','cluster_label_3','cluster_label_4']]
X_train = X[X['month']!=8].drop(['cluster_label_0','cluster_label_1','cluster_label_2','cluster_label_3','cluster_label_4'],axis=1)
X_test = X[X['month']==8].drop(['cluster_label_0','cluster_label_1','cluster_label_2','cluster_label_3','cluster_label_4'],axis=1)
y_train = y[y['month']!=8][['cluster_label_0','cluster_label_1','cluster_label_2','cluster_label_3','cluster_label_4']]
y_test = y[y['month']==8][['cluster_label_0','cluster_label_1','cluster_label_2','cluster_label_3','cluster_label_4']]
We are going to use a special type of regressors this time, the MultiOutputRegressor from sklearn which are a Multi target regression. This strategy consists of fitting one regressor per target. This is a simple strategy for extending regressors that do not natively support multi-target regression. And for the purpose of prediction we will test a linear and a non-linear model, LinearRegression() vs RandomForestRegressor().
MORLinear = MultiOutputRegressor(LinearRegression())
MORETress = MultiOutputRegressor(ExtraTreesRegressor())
MORLinear.fit(X_train,y_train)
predsLinear = MORLinear.predict(X_test)
print('MultiOutputRegressor(LinearRegression)')
print(MORLinear.score(X_test,y_test.values))
print(np.sqrt(mean_squared_error(predsLinear,y_test.values)))
MultiOutputRegressor(LinearRegression) 0.5248864922096287 73.07309039238756
fig, axs = plt.subplots(len(y_test.columns), figsize=(16,12))
for i in range(len(y_test.columns)):
axs[i].scatter(predsLinear[:,0],y_test.iloc[:,i].values)
axs[i].set_title(y_test.columns[i])
plt.tight_layout()
fig, axs = plt.subplots(len(y_test.columns), figsize=(16,12))
for i in range(len(y_test.columns)):
axs[i].plot(y_test.iloc[:,i].values)
axs[i].plot(predsLinear[:,0])
axs[i].set_title(y_test.columns[i])
plt.tight_layout()
MORETress.fit(X_train,y_train)
predsTress = MORETress.predict(X_test)
print('MultiOutputRegressor(ExtraTreesRegressor)')
print(MORETress.score(X_test,y_test.values))
print(np.sqrt(mean_squared_error(predsTress,y_test.values)))
MultiOutputRegressor(ExtraTreesRegressor) 0.6718506549018661 48.790495913706145
fig, axs = plt.subplots(len(y_test.columns), figsize=(16,12))
plt.title('60 minutes intervals')
for i in range(len(y_test.columns)):
axs[i].scatter(predsTress[:,0],y_test.iloc[:,i].values)
axs[i].set_title(y_test.columns[i])
plt.tight_layout()
fig, axs = plt.subplots(len(y_test.columns), figsize=(16,12))
plt.title('60 minutes intervals')
for i in range(len(y_test.columns)):
axs[i].plot(y_test.iloc[:,i].values)
axs[i].plot(predsTress[:,0])
axs[i].set_title(y_test.columns[i])
plt.tight_layout()
In a nutshell, our prediction didn't make us much more wise because we didn't identify a model with good prediction score although the 0.67195 R^2 of MultiOutputRegressor(ExtraTreesRegressor) is the benchmark model we were asked in project's description. The model as we can see is not able to understand the changes after a specific point where the rides raise a lot and in top of that none of the model is capable at all to predict the cluster_4. The cluster with the very few rides, the smallest of all. To sum up, the models have potential if they will be combined with more data like weather data, more observations and with the correct parametrization of them.